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Ph. 
■^ I Abstract 

0^ I We consider concurrent games played on graphs. At every round of the game, each player 

simultaneously and independently selects a move; the moves jointly determine the transition 

to a successor state. Two basic objectives are the safety objective: "stay forever in a set F of 

f-H ■ states" , and its dual, the reachability objective, "reach a set R of states" . We present in this 

r^ I paper a strategy improvement algorithm for computing the value of a concurrent safety game, 

that is, the maximal probability with which player 1 can enforce the safety objective. The 

^ I algorithm yields a sequence of player- 1 strategies which ensure probabilities of winning that 

converge monotonically to the value of the safety game. 

The significance of the result is twofold. First, while strategy improvement algorithms were 
^ ■ known for Markov decision processes and turn-based games, as well as for concurrent reachability 

^^ I games, this is the first strategy improvement algorithm for concurrent safety games. Second, 

C^ . and most importantly, the improvement algorithm provides a way to approximate the value 

of a concurrent safety game from below (the known value-iteration algorithms approximate 
the value from above). Thus, when used together with value- iteration algorithms, or with 
"nI I strategy improvement algorithms for reachability games, our algorithm leads to the first practical 

^^ • algorithm for computing converging upper and lower bounds for the value of reachability and 

f— ^ I safety games. 

> 



1 Introduction 

We consider games played between two players on graphs. At every round of the game, each of the 
two players selects a move; the moves of the players then determine the transition to the successor 
state. A play of the game gives rise to a path on the graph. We consider two basic goals for the 
players: reachability, and safety. In the reachability goal, player 1 must reach a set of target states 
or, if randomization is needed to play the game, then player 1 must maximize the probability of 
reaching the target set. In the safety goal, player 1 must ensure that a set of target states is never 
left or, if randomization is required, then player 1 must ensure that the probability of leaving the 
target set is as low as possible. The two goals are dual, and the games are determined: the maximal 
probability with which player 1 can reach a target set is equal to one minus the maximal probability 
with which player 2 can confine the game in the complement set [18]. 
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These games on graphs can be divided into two classes: turn-based and concurrent. In turn- 
based games, only one player has a choice of moves at each state; in concurrent games, at each 
state both players choose a move, simultaneously and independently, from a set of available moves. 

For turn-based games, the solution of games with reachability and safety goals has long been 
known. If the move played determines uniquely the successor state, the games can be solved in 
linear-time in the size of the game graph. If the move played determines a probability distribution 
over the successor state, the problem of deciding whether a safety of reachability can be won with 
probability greater than p G [0, 1] is in NP n co-NP [5], and the exact value of a game can be 
computed by strategy improvement algorithms [6]. These results all hinge on the fact that turn- 
based reachability and safety games can be optimally won with deterministic, and memoryless, 
strategies. These strategies are functions from states to moves, so they are finite in number, and 
guarantees the termination of the algorithms. 

The situation is different for the concurrent case, where randomization is needed even in the 
case in which the moves played by the players uniquely determine the successor state. The value of 
the game is defined, as usual, as the sup-inf value: the supremum, over all strategies of player 1, of 
the infimum, over all strategies of player 2, of the probability of achieving the safety or reachability 
goal. In concurrent reachability games, players are only guaranteed the existence of e-optimal 
strategies, that ensure that the value of the game is achieved within a specified e > [17]; these 
strategies (which depend on e) are memoryless, but in general need randomization [10]. However, 
for concurrent safety games memoryless optimal strategies exist [11]. Thus, these strategies are 
mappings from states, to probability distributions over moves. 

While complexity results are available for the solution of concurrent reachability and safety 
games, practical algorithms for their solution, that can provide both a value, and an estimated 
error, have so far been lacking. The question of whether the value of a concurrent reachability or 
safety game is at least p G [0,1] can be decided in PS pace via a reduction to the theory of the real 
closed field [13]. This yields a binary-search algorithm to approximate the value. This approach is 
theoretical, but complex due to the complex decision algorithms for the theory of reals. 

Thus far, the only practical approach to the solution of concurrent safety and reachability games 
has been via value iteration, and via strategy improvement for reachability games. In [11] it was 
shown how to construct a series of valuations that approximates from below, and converges, to 
the value of a reachability game; the same algorithm provides valuations converging from above 
to the value of a safety game. In [4], it was shown how to construct a series of strategies for 
reachability games that converge towards optimality. Neither scheme is guaranteed to terminate, 
not even strategy improvement, since in general only e-optimal strategies are guaranteed to exist. 
Both of these approximation schemes lead to practical algorithms. The problem with both schemes, 
however, is that they provide only lower bounds for the value of reachability games, and only upper 
bounds for the value of safety games. As no bounds are available for the speed of convergence of 
these algorithms, the question of how to derive the matching bounds has so far been open. 

In this paper, we present the first strategy improvement algorithm for the solution of concurrent 
safety games. Given a safety goal for player 1, the algorithm computes a sequence of memoryless, 
randomized strategies ttJ , 7r| , vr^ , . . . for player 1 that converge towards optimality. Albeit memory- 
less randomized optimal strategies exist for safety goals [11], the strategy improvement algorithm 
may not converge in finitely many iterations: indeed, optimal strategies may require moves to be 
played with irrational probabilities, while the strategies produced by the algorithm play moves with 
probabilities that are rational numbers. The main significance of the algorithm is that it provides 



a converging sequence of lower bounds for the value of a safety game, and dually, of upper bounds 
for the value of a reachability game. To obtain such bounds, it suffices to compute the value Wfc(s) 
provided by vrf at a state s, for k > 0. Once tt^ is fixed, the game is reduced to a Markov decision 
process, and the value Vk{s) of the safety game can be computed at all s e.g. via linear programming 
[7, 3]. Thus, together with the value or strategy improvement algorithms of [11, 4], the algorithm 
presented in this paper provides the first practical way of computing converging lower and upper 
bounds for the values of concurrent reachability and safety games. We also present a detailed anal- 
ysis of termination criteria for turn-based stochastic games, and obtain an improved upper bound 
for termination for turn-based stochastic games. 

The strategy improvement algorithm for reachability games of [4] is based on locally improving 
the strategy on the basis of the valuation it yields. This approach does not suffice for safety 
games: the sequence of strategies obtained would yield increasing values to player 1, but these 
value would not necessarily converge to the value of the game. In this paper, we introduce a 
novel, and non-local, improvement step, which augments the standard value-based improvement 
step. The non-local step involves the analysis of an appropriately-constructed turn-based game. 
As value iteration for safety games converges from above, while our sequences of strategies yields 
values that converge from below, the proof of convergence for our algorithm cannot be derived from 
a connection with value iteration, as was the case for reachability games. Thus, we developed new 
proof techniques to show both the monotonicity of the strategy values produced by our algorithm, 
and to show convergence to the value of the game. 

2 Definitions 

Notation. For a countable set A, a probability distribution on ^ is a function 6 : A ^ [0,1] 
such that ^ag^ (5(a) = 1. We denote the set of probability distributions on A by X^(A). Given a 
distribution 6 € 'D{A), we denote by Supp{5) = {x G A | 5{x) > 0} the support set of 6. 

Definition 1 (Concurrent games) A (two-player) concurrent game structure G = 
(S", M, ri,r2, (5) consists of the following components: 

• A finite state space S and a finite set M of moves or actions. 

• Two move assignments ri,r2 : S -^ 2^ \ 0. For i € {1,2}, assignment Fj associates with 
each state s G S a nonempty set Ti{s) ^ M of moves available to player i at state s. 

• A probabilistic transition function 5 : S x M x M -^ T){S) that gives the probability 
(5(s, ai, a2)(t) of a transition from s to t when player 1 chooses at state s move ai and player 2 
chooses move 02, for all s,t £ S and oi G ri(s), 02 € r2(s). 

We denote by \6\ the size of transition function, i.e., \6\ = YIsgs aeViis) ber2(s) tes \^(^^ '^■> ^)i^)\-> where 
|(5(s, a, 6)(t)| is the number of bits required to specify the transition probability 5(s,a, 6)(t). We 
denote by |G| the size of the game graph, and |G| = \6\ + \S\. At every state s G S*, player 1 chooses 
a move ai G ri(s), and simultaneously and independently player 2 chooses a move 02 G r2(s). The 
game then proceeds to the successor state t with probability 5{s,ai,a2){t), for all t G S". A state 
s is an absorbing state if for all oi G ri(s) and 02 G r2(s), we have 6{s,ai,a2){s) = 1. In other 
words, at an absorbing state s for all choices of moves of the two players, the successor state is 
always s. 



Definition 2 (Turn-based stochastic games) A turn-based stochastic game graph (^21/2- 
player game graphj G = {{S,E), (Si , S2 , Sr), 6) consists of a finite directed graph (S, E) , a partition 
{Si, S2, Sr) of the finite set S of states, and a probabilistic transition function 6: Sr —i- T>{S), 
where V{S) denotes the set of probability distributions over the state space S. The states in Si 
are the player- 1 states, where player 1 decides the successor state; the states in S2 are the player-2 
states, where player 2 decides the successor state; and the states in Sr are the random or probabihs- 
tic states, where the successor state is chosen according to the probabilistic transition function 5. 
We assume that for s G Sr and t & S, we have {s,t) € E iff 6{s){t) > 0, and we often write 6{s,t) 
for 6{s){t). For technical convenience we assume that every state in the graph {S,E) has at least 
one outgoing edge. For a state s & S, we write E{s) to denote the set {t ^ S \ (s,t) G E} of possi- 
ble successors. We denote by \6\ the size of the transition function, i.e., \6\ = J2ses tes l'^('*)(*)l' 
where \6{s){t)\ is the number of bits required to specify the transition probability 6{s)(t). We denote 
by \G\ the size of the game graph, and \G\ = \6\ + \S\ + \E\. 

Plays. A play w of G is an infinite sequence uj = {sq, si, S2, • • •) of states in S such that for all 
A; > 0, there are moves a^ G ri(sfe) and Og G T2{sk) with (5(sfc, 0^,03 )(sa:+i) > 0. We denote by il 
the set of all plays, and by Qg the set of all plays to = {sq, si, S2, ■ . .) such that sq = s, that is, the 
set of plays starting from state s. 

Selectors and strategies. A selector ^ for player i G {1,2} is a function ^ : S ^ T>(M) such 
that for all states s € S and moves a G M, if ^(s)(a) > 0, then a G rj(s). A selector ^ for player 
i at a state s is a distribution over moves such that if ^(s)(a) > 0, then a G rj(s). We denote 
by Aj the set of all selectors for player i G {1,2}, and similarly, we denote by Ai{s) the set of all 
selectors for player i at a state s. The selector ^ is pure if for every state s G S, there is a move 
a G M such that ^(s)(a) = 1. A strategy for player i G {1,2} is a function vr : S*"*" — > P(M) that 
associates with every finite, nonempty sequence of states, representing the history of the play so 
far, a selector for player i; that is, for all w G S* and s G S*, we have Supp{-K{w • s)) C rj(s). 
The strategy vr is pure if it always chooses a pure selector; that is, for all w G S~^, there is a 
move a € M such that TT{w){a) = 1. A memoryless strategy is independent of the history of the 
play and depends only on the current state. Memoryless strategies correspond to selectors; we 
write ^ for the memoryless strategy consisting in playing forever the selector ^. A strategy is pure 
memoryless if it is both pure and memoryless. In a turn-based stochastic game, a strategy for 
player 1 is a function iri : S* • Si ^ ^{S), such that for all w & S* and for all s G S*! we have 
Supp{'Ki{w ■ s)) C E{s). Memoryless strategies and pure memoryless strategies are obtained as 
the restriction of strategies as in the case of concurrent game graphs. The family of strategies for 
player 2 are defined analogously. We denote by Hi and 112 the sets of all strategies for player 1 
and player 2, respectively. We denote by H^ and 11™ the sets of memoryless strategies and pure 
memoryless strategies for player i, respectively. 

Destinations of moves and selectors. For all states s £ S and moves ai G ri(s) and 02 G 
r2(s), we indicate by Dest{s,ai,a2) = Supp{6{s,ai,a2)) the set of possible successors of s when 
the moves ai and 02 are chosen. Given a state s, and selectors ^1 and ^2 for the two players, we 



denote by 

Dest{s,^i,^2) = [J Dest{s,ai,a2) 

aieSupp(S,i{s)), 
a2&Supp(^2{s)) 

the set of possible successors of s with respect to the selectors ^i and ^2- 

Once a starting state s and strategies vri and tt2 for the two players are fixed, the game is 
reduced to an ordinary stochastic process. Hence, the probabilities of events are uniquely defined, 
where an event ^ C $7^ is a measurable set of plays. For an event A C Jl^, we denote by Pr^^''^^(^) 
the probability that a play belongs to A when the game starts from s and the players follows the 
strategies vri and tt2- Similarly, for a measurable function f : Qg ^ IR, we denote by Ej^''^^(/) the 
expected value of / when the game starts from s and the players follow the strategies vri and tt2- 
For i > 0, we denote by 0j : $7 — > S* the random variable denoting the i-th. state along a play. 

Valuations. A valuation is a mapping v : S ^ [0,1] associating a real number v{s) € [0, 1] with 
each state s. Given two valuations v,w : S ^> IR, we write v <w when v{s) < w{s) for all states 
s € S. For an event A, we denote by Fic'^''-''^'^{A) the valuation S — > [0, 1] defined for all states s G S 
by (Pr'^^'^^{A)){s) = Fi^'^''^^{A). Similarly, for a measurable function / : O^ ^ [0, 1], we denote by 
E^i''^2(/) the valuation S -^ [0, 1] defined for all s G 5 by {E''^^''^{f)){s) = E^'^'l/). 

Reachability and safety objectives. Given a set F C S" of safe states, the objective of a 

safety game consists in never leaving F. Therefore, we define the set of winning plays as the set 

Safe(F) = {{sq, si,S2, ■ ■ ■) & ^ \ Sk £ F for all k > 0}. Given a subset T C S" of target states, the 

objective of a reachability game consists in reaching T. Correspondingly, the set winning plays is 

Reach(T) = {{sq, si, S2, ■ ■ ■) (z ^ \ sj. £ T for some A; > 0} of plays that visit T. For all F Q S and 

T C S, the sets Safe(-F) and Reach(T) is measurable. An objective in general is a measurable set, 

and in this paper we would consider only reachability and safety objectives. For an objective <1>, 

the probability of satisfying ^ from a state s £ S under strategies vri and 7r2 for players 1 and 2, 

respectively, is PrJ^''^^(<l>). We define the value for player 1 of game with objective ^ from the state 

s £ S as 

((l))val(<^)(s) = sup inf Pr-i'-2(<l>); 
TTieHi T^2eU2 

i.e., the value is the maximal probability with which player 1 can guarantee the satisfaction of ^ 
against all player 2 strategies. Given a player-1 strategy tti, we use the notation 

7r2en2 
A strategy vri for player 1 is optimal for an objective <& if for all states s € S, we have 

For e > 0, a strategy vri for player 1 is e-optimal if for all states s G S", we have 

((i)):i,w(«)>(a))vaiw («)-£. 

The notion of values and optimal strategies for player 2 are defined analogously. Reachability and 
safety objectives are dual, i.e., we have Reach(T) = 0,\ Safe(S' \ T). The quantitative determinacy 
result of [18] ensures that for all states s G S*, we have 

((l))vai(Safe(F))(s) + ((2))vai(Reach(5 \ F)){s) = 1. 



Theorem 1 (Memoryless determinacy) For all concurrent game graphs G, for all F,T Q S, 

such that F = S \T, the following assertions hold. 

1. [14] Memoryless optimal strategies exist for safety objectives Safe{F). 

2. [4, 13] For all e > 0, memoryless e-optimal strategies exist for reachability objectives 
Reach{T). 

3. [5] If G is a turn-based stochastic game graph, then pure memoryless optimal strategies exist 
for reachability objectives Reach[T) and safety objectives Safe{F). 

3 Markov Decision Processes 

To develop our arguments, we need some facts about one-player versions of concurrent stochastic 
games, known as Markov decision processes (MDPs) [12, 2]. For i E {1,2}, a player-i MDP (for 
short, i-MDP) is a concurrent game where, for all states s G S*, we have |r3_j(s)| = 1. Given a 
concurrent game G, if we fix a memoryless strategy corresponding to selector ,^1 for player 1, the 
game is equivalent to a 2-MDP G^^ with the transition function 

%(s,a2)(t)= ^ (5(s,ai,a2)(t) •6(s)(ai), 

aiGri(s) 

for all s G S* and a2 € r2(s). Similarly, if we fix selectors .^1 and ,^2 for both players in a concurrent 
game G, we obtain a Markov chain, which we denote by G^-^^^^. 

End components. In an MDP, the sets of states that play an equivalent role to the closed 
recurrent classes of Markov chains [16] are called "end components" [7, 8]. 

Definition 3 (End components) An end component of an i-MDP G, for i G {1, 2}, is a subset 
G (^ S of the states such that there is a selector ^ for player i so that C is a closed recurrent class 
of the Markov chain G^ . 

It is not difficult to see that an equivalent characterization of an end component G is the following. 
For each state s € C, there is a subset Mj(s) C Ti{s) of moves such that: 

1. (closed) if a move in Mi{s) is chosen by player i at state s, then all successor states that are 
obtained with nonzero probability lie in C; and 

2. (recurrent) the graph {G,E), where E consists of the transitions that occur with nonzero 
probability when moves in Mj(-) are chosen by player i, is strongly connected. 

Given a play a; G fi, we denote by Inf(u;) the set of states that occurs infinitely often along uj. 
Given a set ^ C 2 of subsets of states, we denote by Inf(^) the event {lo | Inf(a-') G J-}. The 
following theorem states that in a 2-MDP, for every strategy of player 2, the set of states that are 
visited infinitely often is, with probability 1, an end component. Corollary 1 follows easily from 
Theorem 2. 

Theorem 2 [8] For a player- 1 selector ^i, let C be the set of end components of a 2-MDP G^-^. 
For all player-2 strategies 112 and all states s £ S, we have Pr^ ''^^(Inf(C)) = 1. 
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Corollary 1 For a player-1 selector ^i, let C be the set of end components of a 2-MDP G^^, and 
let Z = UceC ^ ^^ ^^^ ^^^ '^f sto,tes of all end components. For all player-2 strategies tt2 and all 
states s G S, we have Prs'^^{Reach{Z)) = 1. 

MDPs with reachability objectives. Given a 2-MDP with a reachabiUty objective Reach(T) 
for player 2, where T Q S, the values can be obtained as the solution of a linear program [14]. 
The linear program has a variable x{s) for all states s G S, and the objective function and the 
constraints are as follows: 

min y^x{s) subject to 
seS 

x{s) > 2_^x{t) ■ 6{s,a2){t) for all s £ S and 02 € T2{s) 
tes 

x{s) = 1 for all s G T 

< x{s) < 1 for ah seS 

The correctness of the above linear program to compute the values follows from [12, 14]. 

4 Strategy Improvement for Safety Games 

In this section we present a strategy improvement algorithm for concurrent games with safety 
objectives. The algorithm will produce a sequence of selectors 70, 71, 72, ... for player 1, such that: 

1. for ah i > 0, we have ((l))33|(Safe(F)) < {{l)f^lt'(^''HF)); 

2. if there is i > such that 7^ = 7^+1, then ((l))3^|(Safe(F)) = ((l))vai(Safe(F)); and 

3. lim,^oo((l)):MSafe(F)) = ((l))v3i(Safe(F)). 

Condition 1 guarantees that the algorithm computes a sequence of monotonically improving selec- 
tors. Condition 2 guarantees that if a selector cannot be improved, then it is optimal. Condition 3 
guarantees that the value guaranteed by the selectors converges to the value of the game, or equiv- 
alently, that for all e > 0, there is a number i of iterations such that the memoryless player-1 
strategy 7^ is e-optimal. Note that for concurrent safety games, there may be no i > such that 
7j = 7i+i) that is, the algorithm may fail to generate an optimal selector. This is because there are 
concurrent safety games such that the values are irrational [11]. We start with a few notations 

The Pre operator and optimal selectors. Given a valuation v, and two selectors ^1 G Ai 
and ^2 £ ^2) we define the valuations Pre^-^^^^{v), Prei.^-^{v), and Prei{v) as follows, for all states 
s€S: 

Pre^,,Uv){s) = Y, E^(*)-'^(*'«'^)W-^i(^)(«)-^2(s)(6) 
a,beM tes 

Prei.,^^{v)is) = inf Pre^^^^,,iv){s) 

Prei{v){s) = sup inf Pre^^-^^ ^,^{v){s) 
aeAi ?2eA2 



Intuitively, Prei{v){s) is the greatest expectation of v that player 1 can guarantee at a successor 
state of s. Also note that given a valuation v, the computation of Prei (v) reduces to the solution of a 
zero-sum one-shot matrix game, and can be solved by linear programming. Similarly, Prei-^j^{v){s) 
is the greatest expectation of v that player 1 can guarantee at a successor state of s by playing 
the selector .^i. Note that all of these operators on valuations are monotonic: for two valuations 
v,w, i( V < w, then for all selectors .^i G Ai and .^2 G A2, we have Pre^^^^^{v) < Pre(^^^^^{w)^ 
Prei-(^^{v) < Prei-^j^{w), and Prei{v) < Prei{w). Given a valuation v and a state s, we define by 

OptSel(t;,s) = {6 G Ai(s) | Prei.,^,{v){s) = Prei{v){s)} 

the set of optimal selectors for v at state s. For an optimal selector ^1 E OptSel(f , s), we define the 
set of counter-optimal actions as follows: 

CountOpt(t;,s,^i) = {b e r2(s) | Pre^,^biv){s) = Prei{v)is)}. 

Observe that for ,^1 € OptSel(f,s), for all b € r2(s) \ CountOpt(f , s,^i) we have Preg^^fe(f)(s) > 
Prei{v){s). We define the set of optimal selector support and the counter-optimal action set as 
follows: 

OptSelCount(?;, s) = {{A, B) C ri(s) x T2{s) \ 3^i G Ai(s). 6 G OptSel(t;, s) 
A Supp{ii) = A A CountOpt(w,s,^i) = B}] 

i.e., it consists of pairs {A,B) of actions of player 1 and player 2, such that there is an optimal 
selector ,^1 with support A, and B is the set of counter-optimal actions to ,^1. 

Turn-based reduction. Given a concurrent game G = (S", M, ri,r2,(5) and a valuation v we 
construct a turn-based stochastic game G„ = {{S,E), (Si, S2,Sr),6) as follows: 

1. The set of states is as follows: 

S = Su{{s,A,B)\s eS, {A,B)e OptSe\Count{v,s)} 
U {{s,A,b) \ s e S, {A,B) e OptSelCount(f,s), b G B}. 

2. The state space partition is as follows: ^i = S; S2 = {{s,A,B) \ s € S,{A,B) € 
OptSelCount(t;, s)}; and Sr = S\(SiU S2). 

3. The set of edges is as follows: 

E = {{s, (s, A, B))\s£ S, {A, B) eOptSe\Count{v,s)} 

U {{{s, A, B), {s, A, b))\beB}U {{{s, A, b),t)\t£ [J Dest{s, a, b)}. 

aeA 

4. The transition function 5 for all states in Sr is uniform over its successors. 

Intuitively, the reduction is as follows. Given the valuation v, state s is a player 1 state where 
player 1 can select a pair {A,B) (and move to state {s,A,B)) with A C Ti(s) and B C r2(s) 
such that there is an optimal selector ^1 with support exactly A and the set of counter-optimal 
actions to ^1 is the set B. From a player 2 state {s,A,B), player 2 can choose any action b 
from the set B, and move to state {s,A,b). A state {s,A,b) is a probabilistic state where all the 
states in \J^^^Dest{s,a,b) are chosen uniformly at random. Given a set F C 5 we denote by 
F = FU {(s, A,B) eS \ s e F}U {{s, A,b) eS \ s e F}. We refer to the above reduction as TB, 
i.e., (Gy,F)=JB{G,v,F). 

Value-class of a valuation. Given a valuation v and a real < r < 1, the value-class Ur{v) of 
value r is the set of states with valuation r, i.e., Ur{v) = {s £ S \ v{s) = r} 




Figure 1: A turn-based stochastic safety game. 

4.1 The strategy improvement algorithm 

Ordering of strategies. Let G be a concurrent game and F be the set of safe states. Let 
T = S \F. Given a concurrent game graph G with a safety objective Safe(F), the set of almost- 
sure winning states is the set of states s such that the value at s is 1, i.e., Wi = {s £ S \ 
((l))vai(Safe(F)) = 1} is the set of almost-sure winning states. An optimal strategy from Wi is 
referred as an almost-sure winning strategy. The set Wi and an almost-sure winning strategy can 
be computed in linear time by the algorithm given in [9]. We assume without loss of generality 
that all states in Wi U T are absorbing. We define a preorder -< on the strategies for player 1 as 
follows: given two player 1 strategies vri and vr^, let vri -< tt[ if the following two conditions hold: 

(i) {{iraSMF)) < ((l)):|,(Safe(F)); and (ii) {{l)):i,iSMF))is) < ((l)):i,(Safe(F))(s) for some 
state s G S*. Furthermore, we write vri ^ tt[ if either vri -< tt[ or vri = tt[. We first present an 
example that shows the improvements based only on Prei operators are not sufficient for safety 
games, even on turn-based games and then present our algorithm. 

Example 1 Consider the turn-based stochastic game shown in Fig 1, where the □ states are 
player 1 states, the <> states are player 2 states, and Q states are random states with probabil- 
ities labeled on edges. The safety goal is to avoid the state sq. Consider a memoryless strategy tti 
for player 1 that chooses the successor sq -^ S2, and the counter- strategy tt2 for player 2 chooses 
si -^ Sq. Given the strategies vri and 112, the value at sq,si and S2 is 1/3, and since all successors 
of Sq have value 1/3, the value cannot be improved by Prei. However, note that if player 2 is 
restricted to choose only value optimal selectors for the value 1/3, then player 1 can switch to the 
strategy sq -^ S2 and ensure that the game stays in the value class 1/3 with probability 1. Hence 
switching to sq -^ S2 would force player 2 to select a counter-strategy that switches to the strategy 
si — > S3, and thus player 1 can get a value 2/3. I 

Informal description of Algorithm 1. We now present the strategy improvement algorithm 
(Algorithm 1) for computing the values for all states in S\Wi. The algorithm iteratively improves 
player- 1 strategies according to the preorder ~<. The algorithm starts with the random selector 
7o = ?i that plays at all states all actions uniformly at random. At iteration i + 1, the algorithm 
considers the memoryless player-1 strategy 7^ and computes the value {{l))'^'^^{Sah{F)) . Observe 
that since 7^ is a memoryless strategy, the computation of {{!)) ^l^^{Safe{F)) involves solving the 2- 
MDP Gy^. The valuation ({l))J*|(Safe(F)) is named fj. For all states s such that Prei{vi){s) > f j(s), 
the memoryless strategy at s is modified to a selector that is value-optimal for fj. The algorithm 
then proceeds to the next iteration. If Prei{vi) = Vi, then the algorithm constructs the game 
{Gv^,F) = TB(G, Uj,F), and computes Ai as the set of almost-sure winning states in C- for the 
objective Safe(F). Let U = {Ai n S") \ Wi. If U is non-empty, then a selector 7j_|_i is obtained at U 



Algorithm 1 Safety Strategy-Improvement Algorithm 



Input: a concurrent game structure G with safe set F. 
Output: a strategy 7 for player 1. 

0. Compute Wi = {s(^S\ ((l))vai(Safe(F))(s) = 1}. 

1. Let 7o = ^i and i = 0. 

2. Compute vq = ((1))3« (Safe(F)). 

3. do { 

3.1. Let / = {s G 5 \ {Wi U T) \ Prei{vi){s) > Vi{s)}. 

3.2 if 7/0, then 

3.2.1 Let ^1 be a player-1 selector such that for all states s £ I, 
we have Prei-^-^{vi){s) = Prei{vi){s) > Vi{s). 

3.2.2 The player-1 selector 7J+1 is defined as follows: for each state t £ S, let 

[^i(s) lis el. 

3.3 else 

3.3.1 let(G,,,F)=TB(G,t;„F) 

3.3.2 let Ai be the set of almost-sure winning states in G„. for Safe(F) and 
vfi be a pure memoryless almost-sure winning strategy from the set Ai. 

3.3.3 if {(Ai nS)\Wj_^iD) 

3.3.3.1 let U = (AinS)\Wi 

3.3.3.2 The player-1 selector 7J+1 is defined as follows: for t £ S, let 

'-f^{t) if s^U; 
li+i{t) = I 6(s) if s G U,^i{s) G OptSel(wi,s), 

7fi(s) = (s, A, B), B = OptSelCount(s, v, ^1). 

3.4. Compute Vi+i = {{l))2i\Sak{F)) . 

3.5. Let i = i + I. 

} until / = and (li^i n 5) \ VFi = 0. 

4. return 7^. 



from an pure memoryless optimal strategy (i.e., an almost-sure winning strategy) in d,-, and the 
algorithm proceeds to iteration i + 1. If Prei{vi) = Vi and U is empty, then the algorithm stops 
and returns the memoryless strategy 7j for player 1. Unlike strategy improvement algorithms for 
turn-based games (see [6] for a survey). Algorithm 1 is not guaranteed to terminate, because the 
value of a safety game may not be rational. 

Lemma 1 Let 7j and 7J+1 be the player-1 selectors obtained at iterations i and i + 1 of Al- 
gorithm 1. Let I = {s £ S\ {Wi U T) I Prei{vi){s) > Vi{s)}. Let Vi = {{l))^l^{Safe{F)) 

and fj+i = {{1))^^J'^ {Safe{F)) . Then i;j+i(s) > Prei{vi){s) for all states s £ S; and therefore 
Vi+i{s) > Vi{s) for all states s £ S, and Vi+i{.s) > Vi{s) for all states s £ I. 

Proof. Consider the valuations Vi and Wj+i obtained at iterations i and i-1- 1, respectively, and let 
Wi be the valuation defined by Wi{s) = 1 — Vi{s) for all states s £ S. The counter-optimal strategy 
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for player 2 to minimize f j+i is obtained by maximizing the probability to reach T. Let 

\Wi{s) if s e s\r, 

Wi-i-l(s) = < 

^^ [1 - Prei{vi){s) < Wi{s) if s € /. 

In other words, tfj+i = 1 — Prei{vi), and we also have tfj+i < Wi. We now show that Wj+i is 
a feasible solution to the linear program for MDPs with the objective Reach(r), as described in 
Section 3. Since Vi = ((l))Jg|(Safe(-F)), it follows that for all states s ^ S and all moves 02 € r2(s), 
we have 

Wi{s) > '^Wi{t) ■6-y^{s,a2). 
tes 
For all states s € 5'\/, we have 7j(s) = 7j-|_i(s) and ttjj+i(s) = Wi{s), and since tfj+i < Wi, it follows 
that for all states s € S" \ / and all moves 02 € T2{s), we have 

Wi+i{s) = Wi{s) > '^Wi+i{t) ■6^^^-^{s,a2) ( for s G S\I). 
tes 

Since for s € I the selector 7j_|_i(s) is obtained as an optimal selector for Prei{vi){s), it follows 
that for all states s G / and all moves a2 € r2(s), we have 

^^e^«+i,a2K)(s) > Prei{vi)is); 

in other words, 1 — Prei{vi){s) > 1 — Pre^._^_j^^a2{vi){s). Hence for all states s € / and all moves 
02 £ ^2(5), we have 

Wi+i{s) > ^Wi{t)-6.,^^^{s,a2). 
tes 
Since Wj+i < Wi, for all states s € / and all moves 02 G r2(s)5 we have 

Wi+i{s) > '^Wi+i{t) ■ (5^,+i(s,a2) ( for s G /). 
tes 

Hence it follows that t^j+i is a feasible solution to the linear program for MDPs with reachability 
objectives. Since the reachability valuation for player 2 for Reach (T) is the least solution (observe 
that the objective function of the linear program is a minimizing function), it follows that Vi+i > 
1 — tfj+i = Prei{vi). Thus we obtain Vi^i{s) > Vi{s) for all states s € S*, and Vi^i(s) > Vi{s) for 
all states s € /. I 

Recall that by Example 1 it follows that improvement by only step 3.2 is not sufficient to 
guarantee convergence to optimal values. We now present a lemma about the turn-based reduction, 
and then show that step 3.3 also leads to an improvement. Finally, in Theorem 4 we show that 
if improvements by step 3.2 and step 3.3 are not possible, then the optimal value and an optimal 
strategy is obtained. 

Lemma 2 Let G be a concurrent game with a set F of safe states. Let v be a valuation and con- 
sider (Gt,, F) = TB(G, V, F). Let A be the set of almost-sure winning states in G^ for the objective 
Safe{F), and let vfi be a pure memoryless almost-sure winning strategy from A in Gy. Consider 
a memoryless strategy tti in G for states in An S as follows: ifTfi{s) = {s,A,B), then vri(s) € 
OptSel(t;,s) such that Supp{'Ki{s)) = A and OptSelCount(t;, s,7ri(s)) = B. Consider a pure memo- 
ryless strategy 1x2 for player 2. If for all states s G AnS", we have 7r2(s) G OptSelCount(u, s,7ri(s)), 
then for allse^r\S, we have Pi'^^'''^{Safe{F)) = 1. 

11 



Proof. We analyze the Markov chain arising after the player fixes the memoryless strategies 
TTi and -7r2. Given the strategy 7r2 consider the strategy 7r2 as follows: if 7fi{s) = {s,A,B) and 
7r2(s) = 6 E OptSelCount(t;, s,7ri(s)), then at state {s,A,B) choose the successor {s,A,b). Since 
Tfi is an almost-sure winning strategy for Safe(F), it follows that in the Markov chain obtained 
by fixing vfi and 7f2 in Gy, all closed connected recurrent set of states that intersect with A are 
contained in A, and from all states of A the closed connected recurrent set of states within A are 
reached with probability 1. It follows that in the Markov chain obtained from fixing vri and 7r2 in 
G all closed connected recurrent set of states that intersect with An S are contained in AoS, and 
from all states of ^ n S" the closed connected recurrent set of states within AD S are reached with 
probability 1. The desired result follows. I 

Lemma 3 Let 7j and 7j+i be the player-1 selectors obtained at iterations i and i + 1 of Algorithm 1. 
Letl = {s € S\{WiUT) \ Prei{vi){s) > Vi{s)} = 0, and (AinS)\Wi + 0. Letvi = {{l))^l^{Safe{F)) 
and Vi^i = ((l))^g|^^(5'a/e(F)). Then Vi+i{s) > Vi{s) for all states s £ S, and Vi^i{s) > Vi{s) for 
some state s € (Ai H S)\ Wi . 

Proof. We first show that Vj+i > Vi. Let U = {A,i fl S)\ Wi. Let Wi{s) = 1 — Vi{s) for all states 
s € S*. Since Vi = ((l))y3|(Safe(F)), it follows that for all states s G 5 and all moves 02 € r2(s), we 
have 

Wi{s) > '^Wi{t) ■6j^{s,a2). 

t£S 

The selector S,i{s) chosen for 7j_|_i at s G C/ satisfies that ^i(s) S OptSel(t;j, s). It follows that for 
all states s G S and all moves 02 € r2(s), we have 



Wi{s) > ^ Wiit) ■ 6j^^^ {s, 02) 



tes 

It follows that the maximal probability with which player 2 can reach T against the strategy 7^+1 
is at most Wi. It follows that Vi{s) < Vi+i{s). 

We now argue that for some state s £ U we have fi+i(s) > Vi{s). Given the strategy 7j+i, 
consider a pure memoryless counter-optimal strategy tt2 for player 2 to reach T. Since the selectors 
7i+i(s) at states s £ U are obtained from the almost-sure strategy vf in the turn-based game G^. 
to satisfy Safe(-F), it follows from Lemma 2 that if for every state s £ U, the action 7r2(s) £ 
OptSelCount(fi, s,7j+i), then from all states s £U, the game stays safe in F with probability 1. 
Since 7j_|_i is a given strategy for player 1, and tt2 is counter-optimal against 7j_|_i, this would 
imply that U Q {s £ S \ ((l))vai(Safe(F)) = 1}. This would contradict that Wi = {s £ S \ 
((l))vai(Safe(F)) = 1} and U CiWi = (Ji. It follows that for some state s* £ U we have 7r2(s*) 
OptSelCount(fj, s*,7i+i), and since 7j+i(s*) £ OptSel(t;j, s*) we have 

Viis*) < ^Viit) ■6^^^^{s*,TT2is*)); 
tes 

Ms*) > Y^ Wi{t) ■ 5^^^^ {s\tt2{s*)). 
tes 
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in other words, we have 



Define a valuation z as follows: z(s) = Wi[s) for s ^ s* , and z{s*) = Ylt&s'^i^^) ' '^7i+i('^*i ^2(s*))- 
Hence z < Wi, and given the strategy 7j_|_i and the counter-optimal strategy 7r2, the valuation z 
satisfies the inequalities of the linear-program for reachability to T. It follows that the probability 
to reach T given 7^^;^ is at most z. Since z < Wi, it follows that fj+i(s) > Vi{s) for all s £ S, and 
Vi^i{s*) > Vi{s*). This concludes the proof. I 

We obtain the following theorem from Lemma 1 and Lemma 3 that shows that the sequences 
of values we obtain is monotonically non-decreasing. 

Theorem 3 (Monotonicity of values) For i > 0, let ji and 7^+1 be the player-1 selectors 
obtained at iterations i and i + I of Algorithm 1. If 7j 7^ 7i+i; then {{l))2l\{Safe{F)) < 
mlnSafeiF)). 

Theorem 4 (Optimality on termination) Let Vi be the valuation at iteration i of Algorithm 1 
such that Vi = {{l))ll^{Safe{F)). If I = {s e S \ {WiVJ T) \ Prei{vi){s) > Vi{s)} = 0, and 
{Ai r\ S)\ Wi = 0, then 7^ is an optimal strategy and vi = {{l))^a\{Safe{F)) . 

Proof. We show that for all memoryless strategies vri for player 1 we have ((l))y^|(Safe(-F)) < Vi. 
Since memoryless optimal strategies exist for concurrent games with safety objectives (Theorem 1) 
the desired result follows. 

Let 7f2 be a pure memoryless optimal strategy for player 2 in G„. for the objective complementary 
to Safe(F), where (G^,;, Safe(F)) = TB{G,Vi, F). Consider a memoryless strategy vri for player 1, 
and we define a pure memoryless strategy 7r2 for player 2 as follows. 

1. If 7ri(s) OptSel(fj, s), then 7r2(s) = 6 E r2(s), such that -Pre7rj(s) fe(fj)(s) < Vi{s); (such a b 
exists since 7ri(s) ^ OptSel(wj, s)). 

2. If iTi{s) € OptSel(t;j, s), then let A = Supp{TTi{s)), and consider B such that B = 
OptSelCount(wj, s,7ri(s)). Then we have 7r2(s) = b, such that 7f2(('55 ^5 -B)) = {s,A,b). 

Observe that by construction of 7r2, for all s € S\{WiUT), we have Pre^j(s)^^2(s)(^«)('5) ^ ^{{s). We 
first show that in the Markov chain obtained by fixing vri and 112 in G, there is no closed connected 
recurrent set of states C such that C Q S \ (Wi U T). Assume towards contradiction that C is a 
closed connected recurrent set of states in 5" \ (Wi U T). The following case analysis achieves the 
contradiction. 

1. Suppose for every state s G C we have 7ri(s) G OptSel(fi, s). Then consider the strategy 
vfi in G„. such that for a state s € C we have vfi(s) = {s,A,B), where 7ri{s) = A, and 
B = OptSelCount(fi, s,7ri(s)). Since C is closed connected recurrent states, it follows by 
construction that for all states s G C in the game G^- we have Pr'^^'^'^ {Safe{C)) = 1, where 
C = CU {(s, A,B) \ s eC}U {{s, A,b) \ s € C}. It follows that for all s e C in G„^ we have 
PrJi'^2(Safe(F)) = 1. Since 7f2 is an optimal strategy, it follows that C C (AinS)\Wi. This 
contradicts that (Ai n S") \ VFi = 0. 

2. Otherwise for some state s* G C we have vri(s*) ^ OptSel(t;j, s*). Let r = min{g | Ug{vi)riG 7^ 
0}, i.e., r is the least value-class with non-empty intersection with G. Hence it follows 
that for all q < r, we have Uq{vi) r\ C = ^. Observe that since for all s G C we have 
Pf ^1x1(8), ■K2{s){vi)[s) < Vi{s), it follows that for all s G Ur{vi) either (a) Z)est(s,7ri(s), 7r2(s)) C 
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Ur{vi); or (b) Dest{s,-Ki{s),'ir2{s)) n Uq{vi) ^ 0, for some q < r. Since Ur{vi) is the least 
value-class with non-empty intersection with C, it follows that for all s G Ur{vi) we have 
Dest{s, 7ri(s), 7r2(s)) C Ur{vi). It follows that C C Ur{vi). Consider the state s* G C such that 
7ri(s*) OptSel(fj, s). By the construction of 7r2(s), we have -P?'e^i(s*),7r2(s*)('Wi)('5*) < ^i(s*)- 
Hence we must have Dest{s* ,'Ki{s*),'K2{s*)) H Uq{vi) ^ 0, for some q < r. Thus we have a 
contradiction. 

It follows from above that there is no closed connected recurrent set of states in 5 \ (Wi U T), 
and hence with probability 1 the game reaches Wi U T from all states in S" \ (Wi U T). Hence 
the probability to satisfy Safe(F) is equal to the probability to reach Wi. Since for all states 
s ^ S \ {Wi U T) we have -Pre^i(s),7r2(s)(^i)("5) < ^{{s), it follows that given the strategies tti and 
7r2, the valuation Vi satisfies all the inequalities for linear program to reach Wi. It follows that the 
probability to reach Wi from s is atmost Vi{s). It follows that for all s € S" \ {Wi U T) we have 
((l))vai(Safe(F))(s) < Vi{s). The result follows. I 

Convergence. We first observe that since pure memory less optimal strategies exist for turn- 
based stochastic games with safety objectives (Theorem 1), for turn-based stochastic games it 
suffices to iterate over pure memoryless selectors. Since the number of pure memoryless strategies 
is finite, it follows for turn-based stochastic games Algorithm 1 always terminates and yields an 
optimal strategy. For concurrent games, we will use the result that for e > 0, there is a k-uniform 
memoryless strategy that achieves the value of a safety objective with in e. We first define k- 
uniform memoryless strategies. A selector ^ for player 1 is k-uniform. if for all s G S'\ (TUWi) and 
all a E Supp{TTi{s)) there exists i,j G N such that < i < j < k and S,{s){a) = 4, i.e., the moves in 

the support are played with probability that are multiples of j with £ < k. 

Lemma 4 For all concurrent game graphs G, for all safety objectives Safe{F), for F Q S, for all 

20(") 

£ > 0, there exist k-uniform selectors ^ such that ^ is an e-optimal strategy for k = 2 e , where 

n = \S\. 

Proof. (Sketch). For a rational r, using the results of [11], it can be shown that whether 
((l))vai(Safe(F))(s) > r can be expressed in the quantifier free fragment of the theory of reals. 
Then using the formula in the theory of reals and Theorem 13.12 of [1], it can be shown that if 
there is a memoryless strategy tti that achieves value at least r, then there is a A;- uniform memoryless 

I. 20(") 

strategy vr^ that achieves value at least r — e, where k = 2 s , for n = \S\. I 

Strategy improvement with fc-uniform selectors. We first argue that if we restrict Algo- 
rithm 1 such that every iteration yields a fc-uniform selector, then the algorithm terminates. If we 
restrict to /c-uniform selectors, then a concurrent game graph G can be converted to a turn-based 
stochastic game graph, where player 1 first chooses a /c-uniform selector, then player 2 chooses 
an action, and then the transition is determined by the chosen fc- uniform selector of player 1, the 
action of player 2 and the transition function 8 of the game graph G. Then by termination of 
turn-based stochastic games it follows that the algorithm will terminate. Given /c, let us denote by 
z^ the valuation of Algorithm 1 at iteration i, where the selectors are restricted to be fc-uniform, 
and Vi is the valuation of Algorithm 1 at iteration i. Since Vi is obtained without any restriction, 
it follows that for all fc > 0, for all i > 0, we have z^ < vi. From Lemma 4 it follows that for all 
e > 0, there exists a fc > and i > such that for all s we have zi{s) > ((l))vai(Safe(-F))(s) — e. 
This gives us the following result. 
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Theorem 5 (Convergence) Let Vi be the valuation obtained at iteration i of Algorithm 1. Then 
the following assertions hold. 

1. For all e > 0, there exists i such that for all s we have Vi{s) > ((l))vai(5'a/e(-F))(s) — e. 

2. lim,^oot'» = ((l))val(5a/e(F)). 

Complexity. Algorithm 1 may not terminate in general. We briefly describe the complexity of 
every iteration. Given a valuation Vj, the computation of Prei(fj) involves solution of matrix 
games with rewards Vi and can be computed in polynomial time using linear-programming. Given 
Vi and Prei{vi) = Vi, the set OptSel(fj, s) and OptSelCount(t;i, s) can be computed by enumerating 
the subsets of available actions at s and then using linear-programming: for example to check 
{A,B) S OptSelCount(fj, s) it suffices to check that there is an selector ,^i such that ,^i is optimal 
(i.e. for all actions b € '^2{s) we have Pre^-^^i,{vi){s) > Vi{s)); for all a G ^ we have ^i{a) > 0, and 
for a\l a ^ A we have ^i(a) = 0; and to check B is the set of counter-optimal actions we check 
that for 6 G B we have Pre^^^i,{vi){s) = Wi(s); and ioi b ^ B we have Pre^^^i,{vi){s) > Vi{s). All 
the above can be solved by checking feasibility of a set of linear inequalities. Hence TB(G,Vi,F) 
can be computed in time polynomial in size of G and Vi and exponential in the number of moves. 
The set of almost-sure winning states in turn-based stochastic games with safety objectives can be 
computed in linear-time [10]. 

5 Termination for Approximation and Turn-based Games 

In this section we present termination criteria for strategy improvement algorithms for concurrent 
games for e-approximation, and then present an improved termination condition for turn-based 
games. 

Termination for concurrent games. A strategy improvement algorithm for reachability games 
was presented in [4]. We refer to the algorithm of [4] as the reachability strategy improvement 
algorithm. The reachability strategy improvement algorithm is simpler than Algorithm 1: it is 
similar to Algorithm 1 and in every iteration only Step 3.2 is executed (and Step 3.3 need not 
be executed). Applying the reachability strategy improvement algorithm of [4] for player 2, for a 
reachability objective Reach(r), we obtain a sequence of valuations (ui)i>o such that (a) Uj+i > uf, 
(b) if Uj+i = Ui, then Ui = ((2))vai(Reach(T)); and (c) limj^oo ^^i = ((2))vai(R'each(T)). Given a 
concurrent game G with F <^ S and T = S \ F, we apply the reachability strategy improvement 
algorithm to obtain the sequence of valuation (ui)i>o as above, and we apply Algorithm 1 to obtain 
a sequence of valuation {vi)i>o. The termination criteria are as follows: 

1. if for some i we have Uj+i = Ui, then we have Ui = ((2))vai(Reach(T)), and 1 — Uj = 
((l))vai(Safe(F)), and we obtain the values of the game; 

2. if for some i we have Vj+i = Vi, then we have 1 — Vi = ((2))vai(Reach(T)), and Vi = 
((l))vai(Safe(F)), and we obtain the values of the game; and 

3. for e > 0, if for some i > 0, we have Ui + Vi > 1 — e, then for all s G S* we have Vi{s) > 
((l))vai(Safe(F))(s) — e and Ui{s) > ((2))vai(Reach(r))(s) — e (i.e., the algorithm can stop for 
e-approximation) . 
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Observe that since (uj)j>o and (fj)j>o are both monotomcahy non-decreasing and ((l))vai(Safe(F))-|- 
((2))vai(R'each(T)) = 1, it fohows that if Ui + Vi > 1 — e, then forall j > i we have Ui > Uj — e and 
Vi > Vj — e. This estabhshes that Ui > ((l))vai(Safe(F)) — e and Vi > ((2))vai(Reach(r)) — e; and 
the correctness of the stopping criteria (3) for e-approxiniation follows. We also note that instead 
of applying the reachability strategy improvement algorithm, a value-iteration algorithm can be 
applied for reachability games to obtain a sequence of valuation with properties similar to (ui)j>o 
and the above termination criteria can be applied. 

Theorem 6 Let G be a concurrent game graph with a safety objective Safe{F). Algorithm 1 and the 
reachability strategy improvement algorithm for player 2 for the reachability objective Reach{S \ F) 
yield sequence of valuations (wi)i>o and {ui)i>o, respectively, such that (a) for all i > 0, we have 
Vi < ((l))vai(5'a/e(F)) < 1 - Ui; and (b) limj^oo -"i = hmj^oo I - Ui = ((l))vai(5'a/e(F)). 

Termination for turn-based games. For turn-based stochastic games Algorithm 1 and as well 
as the reachability strategy improvement algorithm terminates. Each iteration of the reachability 
strategy improvement algorithm of [4] is computable in polynomial time, and here we present a ter- 
mination guarantee for the reachability strategy improvement algorithm. To apply the reachability 
strategy improvement algorithm we assume the objective of player 1 to be a reachability objec- 
tive Reach(T), and the correctness of the algorithm relies on the notion of proper strategies. Let 
W2 = {s (z S \ ((l))vai(Reach(T))(s) = 0}. Then the notion of proper strategies and its properties 
are as follows. 

Definition 4 (Proper strategies and selectors) A player- 1 strategy vri is proper if for all 

player-2 strategies 112, and for all states s G S" \ (T U W2), we have Pr^^''"^ {Reach{T U W2)) = 1. A 
player-1 selector S,i is proper if the memoryless player-1 strategy ^^ is proper. 

Lemma 5 ([4]) Given a selector ^i for player 1, the memoryless player-1 strategy ^i is proper iff 
for every pure selector ^2 for player 2, and for all states s ^ S, we have Pr}^ {Reach{TUW2)) = 1. 

The following result follows from the result of [4] specialized for the case of turn-based stochastic 
games. 

Lemma 6 Let G be a turn-based stochastic game with reachability objective Reach{T) for player 1. 
Let 7o be the initial selector, and 7^ be the selector obtained at iteration i of the reachability strategy 
improvement algorithm. If 7, is a pure, proper selector, then the following assertions hold: 

1. for all i > 0, we have 7^ is a pure, proper selector; 

2. for alii > 0, we have lij+i > ui, where Ui = ((1))3^| (i?eac/i(T)) anduij^i = {{!)) ^'J'^ {Reach(T)) ; 
and 

3. i/tij+i = Ui, then Ui = ((l))vai(-Reac/i(T)), and there exists i such that Uj+i = Uj. 

The strategy improvement algorithm of Condon [6] works only for halting games, but the reacha- 
bility strategy improvement algorithm works if we start with a pure, proper selector for reachability 
games that are not halting. Hence to use the reachability strategy improvement algorithm to com- 
pute values we need to start with a pure, proper selector. We present a procedure to compute a 
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pure, proper selector, and then present termination bounds (i.e., bounds on i such that Uj+i = Ui). 
The construction of pure, proper selector is based on the notion of attractors defined below. 

Attractor strategy. Let ^o = W^2 U T, and for i > we have 

Ai+i = AiU{seSiUSR\ E{s) n ^i y^ 0} U {s G ^2 I E{s) C AJ. 

Since for all s G S* \ W2 we have ((l))vai(Reach(T)) > 0, it follows that from all states in 5" \ W2 
player 1 can ensure that T is reached with positive probability. It follows that for some z > we 
have Ai = S. The pure attractor selector ^* is as follows: for a state s € (v4j_|-i \ Ai) D Si we have 
^*(s)(t) = 1, where t £ Ai (such a t exists by construction). The pure memoryless strategy ^* 
ensures that for all i > 0, from ^j+i the game reaches Ai with positive probability. Hence there 
is no end-component C contained in S* \ (W2 U T) in the MDP Gttt. It follows that ^* is a pure 
selector that is proper, and the selector ^* can be computed in OdE'l) time. This completes the 
reachability strategy improvement algorithm for turn-based stochastic games. We now present the 
termination bounds. 

Termination bounds. We present termination bounds for binary turn-based stochastic games. A 
turn-based stochastic game is binary if for all s € Sr we have |-E'(s)| < 2, and for all s £ Sr if 
|£^(s)| = 2, then for all t G E{s) we have 6{s){t) = ^, i.e., for all probabilistic states there are at 
most two successors and the transition function 6 is uniform. 

Lemma 7 Let G be a binary Markov chain with \S\ states with a reachability objective ReachiT). 
Then for all s £ S we have {{l))^a\{Reach(T)) = ^, withp,q G N andp,q < 4'^'~'^ . 

Proof. The results follow as a special case of Lemma 2 of [6]. Lemma 2 of [6] holds for halting 
turn-based stochastic games, and since Markov chains reaches the set of closed connected recurrent 
states with probability 1 from all states the result follows. I 

Lemma 8 Let G be a binary turn-based stochastic game with a reachability objective Reach{T). 
Then for all s £ S we have {{l))ya\{Reach(T)) = ^ withp,q £ N andp,q < 4'^'^. 

Proof. Since pure memoryless optimal strategies exist for both players (Theorem 1), we fix pure 
memoryless optimal strategies tti and 112 for both players. The Markov chain (7,^,772 can be then 
reduced to an equivalent Markov chains with \Sfi,\ states (since we fix deterministic successors for 
states in S*! U52, they can be collapsed to their successors). The result then follows from Lemma 7. 

I 

From Lemma 8 it follows that at iteration i of the reachability strategy improvement algorithm 
either the sum of the values either increases by ^ , , or else there is a valuation m such that 
tij+i = Ui. Since the sum of values of all states can be at most 15*1, it follows that algorithm 
terminates in at most \S\ .4''^^'"-^ steps. Moreover, since the number of pure memoryless strategies 
is at most JlseS 1-^(^)1' ^^^ algorithm terminates in at most HseSi 1-^(^)1 steps. It follows from 
the results of [19] that a turn-based stochastic game graph G can be reduced to a equivalent binary 
turn-based stochastic game graph G' such that the set of player 1 and player 2 states in G and 
G' are the same and the number of probabilistic states in G' is 0(|(5|), where |(5| is the size of the 
transition function in G. Thus we obtain the following result. 
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Theorem 7 Let G he a turn-based stochastic game with a reachability objective Reach{T), then the 
reachability strategy improvement algorithm computes the values in time 

0(min{n \Eis)\,20m}.poly{\G\); 
seSi 

where poly is polynomial function. 

The results of [15] presented an algorithm for turn-based stochastic games that works in time 
OdS'/jj! • poly{\G\)). The algorithm of [15] works only for turn-based stochastic games, for general 
turn-based stochastic games the complexity of the algorithm of [15] is better. However, for turn- 
based stochastic games where the transition function at all states can expressed in constant bits we 
have \6\ = 0{\Sii\)- In these cases the reachability strategy improvement algorithm (that works for 
both concurrent and turn-based stochastic games) works in time 2 ^'^'' ■ poly{\G\) as compared to 
the time 20(l'5fll-i°g(l'5fll) . poly{\G\) of the algorithm of [15]. 
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